Submodularity-inspired Data Selection for Goal-oriented Chatbot Training based on Sentence Embeddings
نویسندگان
چکیده
Goal–oriented (GO) dialogue systems rely on an initial natural language understanding (NLU) module to determine the user’s intention and parameters thereof — also known as slots. Since the systems, also known as bots, help the users with solving problems in relatively narrow domains, they require training data within those domains. This leads to significant data availability issues that inhibit the development of successful bots. To alleviate this problem, we propose a technique of data selection in the low–data regime that allows training with significantly fewer labeled sentences, thus smaller labelling costs. We create a submodularity–inspired data ranking function, the ratio penalty marginal gain, to select data points to label based solely on the information extracted from the textual embedding space. We show that the distances in the embedding space are a viable source of information for data selection. This method outperforms several known active learning techniques, without using the label information. This allows for cost–efficient training of NLU units for goal-oriented bots. Moreover, our proposed selection technique does not need the retraining of the model in between the selection steps, making it time–efficient as well.
منابع مشابه
Summarization Based on Embedding Distributions
In this study, we consider a summarization method using the document level similarity based on embeddings, or distributed representations of words, where we assume that an embedding of each word can represent its “meaning.” We formalize our task as the problem of maximizing a submodular function defined by the negative summation of the nearest neighbors’ distances on embedding distributions, ea...
متن کاملGoal-Oriented Chatbot Dialog Management Bootstrapping with Transfer Learning
Goal-Oriented (GO) Dialogue Systems, colloquially known as goal oriented chatbots, help users achieve a predefined goal (e.g. book a movie ticket) within a closed domain. A first step is to understand the user’s goal by using natural language understanding techniques. Once the goal is known, the bot must manage a dialogue to achieve that goal, which is conducted with respect to a learnt policy....
متن کاملRevisiting Recurrent Networks for Paraphrastic Sentence Embeddings
We consider the problem of learning general-purpose, paraphrastic sentence embeddings, revisiting the setting of Wieting et al. (2016b). While they found LSTM recurrent networks to underperform word averaging, we present several developments that together produce the opposite conclusion. These include training on sentence pairs rather than phrase pairs, averaging states to represent sequences, ...
متن کاملNot All Neural Embeddings are Born Equal
Neural language models learn word representations that capture rich linguistic and conceptual information. Here we investigate the embeddings learned by neural machine translation models. We show that translation-based embeddings outperform those learned by cutting-edge monolingual models at single-language tasks requiring knowledge of conceptual similarity and/or syntactic role. The findings s...
متن کاملConvolutional Sentence Kernel from Word Embeddings for Short Text Categorization
This paper introduces a convolutional sentence kernel based on word embeddings. Our kernel overcomes the sparsity issue that arises when classifying short documents or in case of little training data. Experiments on six sentence datasets showed statistically significant higher accuracy over the standard linear kernel with ngram features and other proposed models.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1802.00757 شماره
صفحات -
تاریخ انتشار 2018